#read data
rm(list=ls())
setwd("~/Documents/UCI/Customer and Social Analytics/Midterm")
highnode = read.csv("HighNote Data Midterm.csv")

Summary statistics

library(psych)
library(dplyr)
options(digits = 5)
options(scipen=3)

#Overall summary statistics
describe(highnode[-1])
#Summary statistics for adopter
describe(filter(highnode, adopter == 1)[-1])
#Summary statistics for non-adopter
describe(filter(highnode, adopter == 0)[-1])
#Compare means between the adopter and non-adapter subsamples
for (i in 1:15) {
  
  #skip adopter
  if(i == 13) next()
  
  print(names(highnode[i+1]))
  print(t.test(highnode[,i+1]~highnode$adopter))
}
## [1] "age"
## 
##  Welch Two Sample t-test
## 
## data:  highnode[, i + 1] by highnode$adopter
## t = -17, df = 4080, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.2658 -1.7971
## sample estimates:
## mean in group 0 mean in group 1 
##          23.948          25.980 
## 
## [1] "male"
## 
##  Welch Two Sample t-test
## 
## data:  highnode[, i + 1] by highnode$adopter
## t = -13.7, df = 4300, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.122787 -0.091954
## sample estimates:
## mean in group 0 mean in group 1 
##         0.62186         0.72923 
## 
## [1] "friend_cnt"
## 
##  Welch Two Sample t-test
## 
## data:  highnode[, i + 1] by highnode$adopter
## t = -10.6, df = 3680, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -25.154 -17.330
## sample estimates:
## mean in group 0 mean in group 1 
##          18.492          39.734 
## 
## [1] "avg_friend_age"
## 
##  Welch Two Sample t-test
## 
## data:  highnode[, i + 1] by highnode$adopter
## t = -15.7, df = 4140, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.6089 -1.2509
## sample estimates:
## mean in group 0 mean in group 1 
##          24.011          25.441 
## 
## [1] "avg_friend_male"
## 
##  Welch Two Sample t-test
## 
## data:  highnode[, i + 1] by highnode$adopter
## t = -4.44, df = 4590, p-value = 0.0000091
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.02884 -0.01118
## sample estimates:
## mean in group 0 mean in group 1 
##         0.61659         0.63660 
## 
## [1] "friend_country_cnt"
## 
##  Welch Two Sample t-test
## 
## data:  highnode[, i + 1] by highnode$adopter
## t = -21.3, df = 3790, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.5288 -2.9331
## sample estimates:
## mean in group 0 mean in group 1 
##          3.9579          7.1888 
## 
## [1] "subscriber_friend_cnt"
## 
##  Welch Two Sample t-test
## 
## data:  highnode[, i + 1] by highnode$adopter
## t = -12.3, df = 3630, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.4139 -1.0248
## sample estimates:
## mean in group 0 mean in group 1 
##         0.41747         1.63680 
## 
## [1] "songsListened"
## 
##  Welch Two Sample t-test
## 
## data:  highnode[, i + 1] by highnode$adopter
## t = -21.6, df = 3790, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -17634 -14703
## sample estimates:
## mean in group 0 mean in group 1 
##           17589           33758 
## 
## [1] "lovedTracks"
## 
##  Welch Two Sample t-test
## 
## data:  highnode[, i + 1] by highnode$adopter
## t = -21.2, df = 3710, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -193.94 -161.09
## sample estimates:
## mean in group 0 mean in group 1 
##          86.823         264.341 
## 
## [1] "posts"
## 
##  Welch Two Sample t-test
## 
## data:  highnode[, i + 1] by highnode$adopter
## t = -4.22, df = 3660, p-value = 0.000026
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -23.3067  -8.5082
## sample estimates:
## mean in group 0 mean in group 1 
##           5.293          21.200 
## 
## [1] "playlists"
## 
##  Welch Two Sample t-test
## 
## data:  highnode[, i + 1] by highnode$adopter
## t = -8.08, df = 3630, p-value = 8.6e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.43676 -0.26621
## sample estimates:
## mean in group 0 mean in group 1 
##         0.54928         0.90077 
## 
## [1] "shouts"
## 
##  Welch Two Sample t-test
## 
## data:  highnode[, i + 1] by highnode$adopter
## t = -3.57, df = 3540, p-value = 0.00037
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -107.662  -31.272
## sample estimates:
## mean in group 0 mean in group 1 
##          29.973          99.440 
## 
## [1] "tenure"
## 
##  Welch Two Sample t-test
## 
## data:  highnode[, i + 1] by highnode$adopter
## t = -5.04, df = 4150, p-value = 0.00000048
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.4626 -1.0840
## sample estimates:
## mean in group 0 mean in group 1 
##          43.810          45.583 
## 
## [1] "good_country"
## 
##  Welch Two Sample t-test
## 
## data:  highnode[, i + 1] by highnode$adopter
## t = 8.8, df = 4250, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.054636 0.085954
## sample estimates:
## mean in group 0 mean in group 1 
##         0.35779         0.28750

Tentative conclusions

After comparing the adopter and non-adopter subsamples, it could be concluded that adopter group has following attributes:

  • older (2 years), male (10% more), and not from US, UK or Germany (7% less)

  • more friend (2 times), average friend’s age is older (1 year), from more countries (2 times), and more likely a subscriber (4 times)

  • longer time on the site (2 more month), listened to more songs (2 times), love more tracks (3 times), have more posts (4 times), playlist (2 times), and shouts (3 times)


Data Visualization

Demographics

The age distribution between non-adopter and adopter is almost the same. The average age of adopter is slightly larger than non-adopter. In the adopter group, the proportion of male is larger than non-adopter group. It can be concluded that the typical adopter is older male from the perspective of demographics.

library(ggplot2)
library(gridExtra)
highnode$adopter = factor(highnode$adopter, labels = c("non-adopter","adopter"))

#age histogram
g1 = ggplot(highnode, aes(x = age))+
    geom_histogram(aes(fill=adopter), binwidth = 1)+
    labs(title="Age Histogram", 
        subtitle="bin_width = 1", 
        fill="", 
        caption="Source: HighNode")  

#age density
g2 = ggplot(highnode, aes(x = age))+
    geom_density(aes(fill=adopter), alpha=0.6)+
    labs(title="Age Density", 
        fill="", 
        caption="Source: HighNode")  

#gender bar
highnode$male = factor(highnode$male, labels = c("female","male"))

g3 = ggplot(filter(highnode, adopter=="adopter"), aes(adopter) )+
    geom_bar(aes(fill=male), width = 0.3)+
    labs(title="Gender in adopters", 
        fill="", 
        caption="Source: HighNode")+
    xlab(" ")

g4 = ggplot(filter(highnode, adopter=="non-adopter"), aes(adopter) )+
    geom_bar(aes(fill=male), width = 0.3)+
    labs(title="Gender in non-adopters", 
        fill="", 
        caption="Source: HighNode")+
    xlab(" ")

#plot
grid.arrange(g1,g2,g3,g4,
             top = "Demographics")

Peer influence

If the subscription rate of friends is defined as subscriber_friend_cnt/friend_cnt, the subscription rate of adopter group is more than 2 times higher than non-adopter group. This means a friend of adopter is more likely to be a subscriber. Also, the relationship between the number of friends and the number of friends who are premium subscribers is close to liner. There is no significant decrease when the number of friends getting large. As mentioned in summary statistics, adopter group will have more friends. In a word, the peer influence does exist. In addition, the adopter group has more friends who have larger average age, are from more countries, and are more likely to be a subscriber, which is shown in the t test above.

#Friend_cnt vs Subscriber_friend_cnt
g5 = ggplot(highnode, aes(x = friend_cnt, y = subscriber_friend_cnt))+
  geom_point(aes(color = adopter), size = 1, alpha = 0.8)+
  geom_smooth(aes(color = adopter), se = F)+
  coord_cartesian(xlim=c(0, 2000), ylim=c(0, 75)) + 
  labs(title="Friend_cnt vs Subscriber_friend_cnt",
       color = " ", 
       caption="Source: HighNode")

#subscription rate
highnode = mutate(highnode, subscription_rate = subscriber_friend_cnt/friend_cnt)
adopter_subscription_rate = mean(highnode$subscription_rate[highnode$adopter == "adopter"])
nonadopter_subscription_rate = mean(highnode$subscription_rate[highnode$adopter == "non-adopter"])

g6 = ggplot(highnode, aes(x = ID, y = subscription_rate))+
  geom_point(aes(color = adopter), size = 0.7, alpha = 0.8)+
  labs(title="Subscription rate of friends",
       color = " ", 
       y = "Rate",
       caption="Source: HighNode")

g7 = ggplot(highnode, aes(x = adopter, y = subscription_rate)) + 
  geom_bar(stat = "summary", fun.y = "mean", width = 0.3, fill = "seagreen")

highnode$subscription_rate = NULL

#plot
grid.arrange(g6,g7,g5,
             layout_matrix = rbind(c(1,1),
                                   c(2,3)),
             top = "Peer influence")

User engagement

In each aspects of user engagement, adopter group is higher than non-adopter group. It could be assumed that the more user engaged in the site, the more likely he/she will become adopter.

#songsListened
g8 = ggplot(highnode, aes(x = adopter, y = songsListened)) + 
  geom_bar(stat = "summary", fun.y = "mean", width = 0.3)+
  labs(x = " ")

#lovedTracks
g9 = ggplot(highnode, aes(x = adopter, y = lovedTracks)) + 
  geom_bar(stat = "summary", fun.y = "mean", width = 0.3)+
  labs(x =" ")

#posts
g10 = ggplot(highnode, aes(x = adopter, y = posts)) + 
  geom_bar(stat = "summary", fun.y = "mean", width = 0.3)+
  labs(x = "Group")

#playlists
g11 = ggplot(highnode, aes(x = adopter, y = playlists)) + 
  geom_bar(stat = "summary", fun.y = "mean", width = 0.3)+
  labs(x = " ")
  
#shouts
g12 = ggplot(highnode, aes(x = adopter, y = shouts)) + 
  geom_bar(stat = "summary", fun.y = "mean", width = 0.3)+
  labs(x = " ")

g13 = ggplot(highnode, aes(x = adopter, y = tenure)) + 
  geom_bar(stat = "summary", fun.y = "mean", width = 0.3)+
  labs(x = " ")
  
#plots
grid.arrange(g8,g9,g10,g11,g12,g13,nrow = 1,top = "User engagement", bottom = "Source: HighNode")


Propensity Score Matching (PSM)

Propensity score estimation

#create a variable to represent treatment
highnode = mutate(highnode, treat = ifelse(subscriber_friend_cnt>=1,1,0))

#test the difference before PSM
t.test((as.integer(highnode$adopter)-1)~highnode$treat)
## 
##  Welch Two Sample t-test
## 
## data:  (as.integer(highnode$adopter) - 1) by highnode$treat
## t = -31, df = 11800, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.13303 -0.11719
## sample estimates:
## mean in group 0 mean in group 1 
##        0.052435        0.177543
#logistic regression
m_ps <- glm(treat ~ age + male + friend_cnt + avg_friend_age + avg_friend_male + friend_country_cnt 
            + songsListened + lovedTracks + posts + playlists + shouts + tenure + good_country,
            family = binomial(), data = highnode)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(m_ps)
## 
## Call:
## glm(formula = treat ~ age + male + friend_cnt + avg_friend_age + 
##     avg_friend_male + friend_country_cnt + songsListened + lovedTracks + 
##     posts + playlists + shouts + tenure + good_country, family = binomial(), 
##     data = highnode)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -4.421  -0.567  -0.422  -0.300   2.562  
## 
## Coefficients:
##                        Estimate   Std. Error z value Pr(>|z|)    
## (Intercept)        -5.143967704  0.077025853  -66.78  < 2e-16 ***
## age                 0.019697667  0.002807995    7.01  2.3e-12 ***
## malemale            0.043114235  0.029980958    1.44  0.15042    
## friend_cnt          0.031321849  0.001033703   30.30  < 2e-16 ***
## avg_friend_age      0.079550730  0.003481500   22.85  < 2e-16 ***
## avg_friend_male     0.251388142  0.050285006    5.00  5.8e-07 ***
## friend_country_cnt  0.111034819  0.004764945   23.30  < 2e-16 ***
## songsListened       0.000006906  0.000000516   13.40  < 2e-16 ***
## lovedTracks         0.000667065  0.000056452   11.82  < 2e-16 ***
## posts               0.000569908  0.000268232    2.12  0.03361 *  
## playlists           0.005638941  0.011897559    0.47  0.63553    
## shouts             -0.000049085  0.000037068   -1.32  0.18543    
## tenure             -0.002571113  0.000776896   -3.31  0.00093 ***
## good_country        0.032012002  0.029217524    1.10  0.27323    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 46640  on 43826  degrees of freedom
## Residual deviance: 34170  on 43813  degrees of freedom
## AIC: 34198
## 
## Number of Fisher Scoring iterations: 7
#predict PSM value
prs_df <- data.frame(pr_score = predict(m_ps, type = "response"),
                     treat = m_ps$model$treat)
head(prs_df)
#Examining the region of common support
labs <- paste("Actual treatment type :", c("having subscriber friends", "No subscriber friends"))
prs_df %>%
  mutate(treat = ifelse(treat == 1, labs[1], labs[2])) %>%
  ggplot(aes(x = pr_score)) +
  geom_histogram(color = "white") +
  facet_wrap(~treat) +
  xlab("Probability of having subscriber friends") +
  theme_bw()

Executing a matching algorithm

#There is no missing value
summary(is.na(highnode))
##      ID             age             male         friend_cnt     
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:43827     FALSE:43827     FALSE:43827     FALSE:43827    
##  NA's :0         NA's :0         NA's :0         NA's :0        
##  avg_friend_age  avg_friend_male friend_country_cnt subscriber_friend_cnt
##  Mode :logical   Mode :logical   Mode :logical      Mode :logical        
##  FALSE:43827     FALSE:43827     FALSE:43827        FALSE:43827          
##  NA's :0         NA's :0         NA's :0            NA's :0              
##  songsListened   lovedTracks       posts         playlists      
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:43827     FALSE:43827     FALSE:43827     FALSE:43827    
##  NA's :0         NA's :0         NA's :0         NA's :0        
##    shouts         adopter          tenure        good_country   
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:43827     FALSE:43827     FALSE:43827     FALSE:43827    
##  NA's :0         NA's :0         NA's :0         NA's :0        
##    treat        
##  Mode :logical  
##  FALSE:43827    
##  NA's :0
#perform "nearest" matching algorithms
library(MatchIt)
mod_match <- matchit(treat ~ age + male + friend_cnt + avg_friend_age + avg_friend_male + friend_country_cnt 
                     + songsListened + lovedTracks + posts + playlists + shouts + tenure + good_country,
                     method = "nearest", data = highnode)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#summary(mod_match)
#plot(mod_match)

#create a dataframe containing only the matched observations
highnode_match <- match.data(mod_match)
dim(highnode_match)
## [1] 19646    19

Examining covariate balance in the matched sample

From the visual inspection and mean difference, the matching result is good but not perfect. It is shown that 8 of 13 covariates are perfectly matched, including age, male, avg_friend_age, avg_friend_male, songsListened, playlists, tenure, and good_country. While 5 of 13 covariates are not matched so well, including friend_cnt, friend_country_cnt, lovedTracks, posts, and shouts.

highnode_match$treat = as.factor(highnode_match$treat)

#Visual inspection
#create a plotting function
fn_bal <- function(dta, variable, treat) {
  dta$variable <- dta[, variable]
  support <- c(min(dta$variable), max(dta$variable))
  ggplot(dta, aes(x = distance, y = variable, color = treat)) +
    geom_point(alpha = 0.2, size = 1.3) +
    geom_smooth(method = "loess", se = F) +
    xlab("Propensity score") +
    ylab(variable) +
    theme_bw() +
    ylim(support)
}

#plot
library(gridExtra)

highnode_match$male = as.integer(highnode_match$male)
grid.arrange(
   fn_bal(highnode_match, "age") + theme(legend.position = "none"),
   fn_bal(highnode_match, "male"),
   top = "Visual inspection", 
   nrow = 1
)

grid.arrange(
   fn_bal(highnode_match, "friend_cnt") + theme(legend.position = "none"),
   fn_bal(highnode_match, "avg_friend_age"),
   top = "Visual inspection", 
   nrow = 1
)

grid.arrange(
   fn_bal(highnode_match, "avg_friend_male") + theme(legend.position = "none"),
   fn_bal(highnode_match, "friend_country_cnt"),
   top = "Visual inspection", 
   nrow = 1
)

grid.arrange(
   fn_bal(highnode_match, "songsListened") + theme(legend.position = "none"),
   fn_bal(highnode_match, "lovedTracks"),
   top = "Visual inspection", 
   nrow = 1
)

grid.arrange(
   fn_bal(highnode_match, "posts") + theme(legend.position = "none"),
   fn_bal(highnode_match, "playlists"),
   top = "Visual inspection", 
   nrow = 1
)
## Warning: Removed 4 rows containing missing values (geom_smooth).

grid.arrange(
   fn_bal(highnode_match, "shouts") + theme(legend.position = "none"),
   fn_bal(highnode_match, "tenure"),
   top = "Visual inspection", 
   nrow = 1
)

grid.arrange(
   fn_bal(highnode_match, "good_country") + theme(legend.position = "none"),
   top = "Visual inspection", 
   nrow = 1
)

#Difference-in-means
highnode_name = names(highnode_match)[c(-1,-8,-14,-17,-18,-19)]

#summary
highnode_match %>%
  group_by(treat) %>%
  select(one_of(highnode_name)) %>%
  summarise_all(funs(mean))
#t test
for (i in 1:13) {
  print(highnode_name[i])
  print(t.test(highnode_match[,highnode_name[i]]~highnode_match$treat))
}
## [1] "age"
## 
##  Welch Two Sample t-test
## 
## data:  highnode_match[, highnode_name[i]] by highnode_match$treat
## t = 9.02, df = 19300, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.75075 1.16761
## sample estimates:
## mean in group 0 mean in group 1 
##          26.332          25.373 
## 
## [1] "male"
## 
##  Welch Two Sample t-test
## 
## data:  highnode_match[, highnode_name[i]] by highnode_match$treat
## t = 3.14, df = 19600, p-value = 0.0017
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.0080145 0.0347423
## sample estimates:
## mean in group 0 mean in group 1 
##          1.6576          1.6363 
## 
## [1] "friend_cnt"
## 
##  Welch Two Sample t-test
## 
## data:  highnode_match[, highnode_name[i]] by highnode_match$treat
## t = -24.8, df = 10500, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -35.127 -29.982
## sample estimates:
## mean in group 0 mean in group 1 
##          21.467          54.021 
## 
## [1] "avg_friend_age"
## 
##  Welch Two Sample t-test
## 
## data:  highnode_match[, highnode_name[i]] by highnode_match$treat
## t = 13.6, df = 18400, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.99898 1.33462
## sample estimates:
## mean in group 0 mean in group 1 
##          26.557          25.390 
## 
## [1] "avg_friend_male"
## 
##  Welch Two Sample t-test
## 
## data:  highnode_match[, highnode_name[i]] by highnode_match$treat
## t = 5.47, df = 19300, p-value = 4.5e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.012411 0.026264
## sample estimates:
## mean in group 0 mean in group 1 
##         0.65515         0.63581 
## 
## [1] "friend_country_cnt"
## 
##  Welch Two Sample t-test
## 
## data:  highnode_match[, highnode_name[i]] by highnode_match$treat
## t = -38.6, df = 13900, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4.5125 -4.0759
## sample estimates:
## mean in group 0 mean in group 1 
##          5.0914          9.3856 
## 
## [1] "songsListened"
## 
##  Welch Two Sample t-test
## 
## data:  highnode_match[, highnode_name[i]] by highnode_match$treat
## t = -11.4, df = 18500, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7472.4 -5277.1
## sample estimates:
## mean in group 0 mean in group 1 
##           27361           33736 
## 
## [1] "lovedTracks"
## 
##  Welch Two Sample t-test
## 
## data:  highnode_match[, highnode_name[i]] by highnode_match$treat
## t = -15.5, df = 16100, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -102.314  -79.327
## sample estimates:
## mean in group 0 mean in group 1 
##          134.54          225.36 
## 
## [1] "posts"
## 
##  Welch Two Sample t-test
## 
## data:  highnode_match[, highnode_name[i]] by highnode_match$treat
## t = -5.68, df = 11000, p-value = 1.4e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -19.1640  -9.3273
## sample estimates:
## mean in group 0 mean in group 1 
##          6.2773         20.5230 
## 
## [1] "playlists"
## 
##  Welch Two Sample t-test
## 
## data:  highnode_match[, highnode_name[i]] by highnode_match$treat
## t = -2.95, df = 17800, p-value = 0.0032
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.119413 -0.024128
## sample estimates:
## mean in group 0 mean in group 1 
##         0.67230         0.74407 
## 
## [1] "shouts"
## 
##  Welch Two Sample t-test
## 
## data:  highnode_match[, highnode_name[i]] by highnode_match$treat
## t = -8.51, df = 10500, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -79.465 -49.702
## sample estimates:
## mean in group 0 mean in group 1 
##          37.236         101.820 
## 
## [1] "tenure"
## 
##  Welch Two Sample t-test
## 
## data:  highnode_match[, highnode_name[i]] by highnode_match$treat
## t = 4.16, df = 19600, p-value = 0.000033
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.61022 1.70007
## sample estimates:
## mean in group 0 mean in group 1 
##          47.704          46.549 
## 
## [1] "good_country"
## 
##  Welch Two Sample t-test
## 
## data:  highnode_match[, highnode_name[i]] by highnode_match$treat
## t = 2.18, df = 19600, p-value = 0.029
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.0015177 0.0282084
## sample estimates:
## mean in group 0 mean in group 1 
##         0.35814         0.34328

Estimating treatment effects

When performing t test on the after-matching data, the result of mean difference is significant. Around 17.8% of the user in treatment group is adopter, which is significantly large than the number of control group (around 8.6%, which is larger compared to the data before matching). The result is also robust in regression both with and without covariates. The coefficient treat is significant in two regression model mentioned above.

After controlling all the covariates, the treatment effect is still significant. It is reasonable to claim that the causal relationship exist between having subscriber friends and the likelihood of becoming an adopter. To be specific, having subscriber friends will increase the likelihood of becoming an adopter.

#t test
with(highnode_match, t.test( (as.integer(adopter)-1) ~ treat) )
## 
##  Welch Two Sample t-test
## 
## data:  (as.integer(adopter) - 1) by treat
## t = -18.9, df = 18100, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.100094 -0.081317
## sample estimates:
## mean in group 0 mean in group 1 
##        0.086837        0.177543
#regression without covariates
fit_test1 = glm(adopter~treat, data = highnode_match, family = binomial())
summary(fit_test1)
## 
## Call:
## glm(formula = adopter ~ treat, family = binomial(), data = highnode_match)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -0.625  -0.625  -0.426  -0.426   2.211  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.3529     0.0358   -65.7   <2e-16 ***
## treat1        0.8198     0.0445    18.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 15345  on 19645  degrees of freedom
## Residual deviance: 14986  on 19644  degrees of freedom
## AIC: 14990
## 
## Number of Fisher Scoring iterations: 5
#regression with covariates
fit_test2 = glm(adopter~treat + age + male + friend_cnt + avg_friend_age + avg_friend_male + friend_country_cnt 
                     + songsListened + lovedTracks + posts + playlists + shouts + tenure + good_country, 
                data = highnode_match, family = binomial())
summary(fit_test2)
## 
## Call:
## glm(formula = adopter ~ treat + age + male + friend_cnt + avg_friend_age + 
##     avg_friend_male + friend_country_cnt + songsListened + lovedTracks + 
##     posts + playlists + shouts + tenure + good_country, family = binomial(), 
##     data = highnode_match)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -3.224  -0.567  -0.456  -0.370   2.526  
## 
## Coefficients:
##                       Estimate  Std. Error z value Pr(>|z|)    
## (Intercept)        -3.67493445  0.14592868  -25.18  < 2e-16 ***
## treat1              0.72927538  0.04681530   15.58  < 2e-16 ***
## age                 0.01419457  0.00406911    3.49  0.00049 ***
## male                0.30396597  0.04899092    6.20  5.5e-10 ***
## friend_cnt         -0.00020016  0.00027927   -0.72  0.47355    
## avg_friend_age      0.01304253  0.00534905    2.44  0.01476 *  
## avg_friend_male     0.06006885  0.09252467    0.65  0.51620    
## friend_country_cnt  0.00727747  0.00364861    1.99  0.04609 *  
## songsListened       0.00000425  0.00000053    8.03  1.0e-15 ***
## lovedTracks         0.00052108  0.00004692   11.11  < 2e-16 ***
## posts               0.00011861  0.00008891    1.33  0.18221    
## playlists           0.04464860  0.01194440    3.74  0.00019 ***
## shouts              0.00011195  0.00007454    1.50  0.13316    
## tenure             -0.00243354  0.00121709   -2.00  0.04556 *  
## good_country       -0.36949176  0.04801809   -7.69  1.4e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 15345  on 19645  degrees of freedom
## Residual deviance: 14493  on 19631  degrees of freedom
## AIC: 14523
## 
## Number of Fisher Scoring iterations: 5

Regression Analyses

First attempt

Use all variables in the regression model. The summary result shows that avg_friend_male, posts and shouts are not significant.

#data preprocssing
highnode = read.csv("HighNote Data Midterm.csv")
categorical_col_name = names(highnode)[c(3,14,16)]
for (i in 1:3) {
  highnode[,categorical_col_name[i]] = as.factor(highnode[,categorical_col_name[i]])
}

#put all variables into the regression
fit1 = glm(adopter ~ age + male + friend_cnt + avg_friend_age + avg_friend_male + friend_country_cnt + 
             subscriber_friend_cnt + songsListened + lovedTracks + posts + playlists + shouts + tenure + 
             good_country, 
           data = highnode, family = binomial())
summary(fit1)
## 
## Call:
## glm(formula = adopter ~ age + male + friend_cnt + avg_friend_age + 
##     avg_friend_male + friend_country_cnt + subscriber_friend_cnt + 
##     songsListened + lovedTracks + posts + playlists + shouts + 
##     tenure + good_country, family = binomial(), data = highnode)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -5.353  -0.411  -0.350  -0.291   2.702  
## 
## Coefficients:
##                           Estimate   Std. Error z value Pr(>|z|)    
## (Intercept)           -4.179243363  0.095711116  -43.67  < 2e-16 ***
## age                    0.019617836  0.003477727    5.64  1.7e-08 ***
## male1                  0.413269630  0.041690899    9.91  < 2e-16 ***
## friend_cnt            -0.004312141  0.000491970   -8.77  < 2e-16 ***
## avg_friend_age         0.029539617  0.004483627    6.59  4.4e-11 ***
## avg_friend_male        0.116206131  0.063462474    1.83    0.067 .  
## friend_country_cnt     0.043258567  0.003616344   11.96  < 2e-16 ***
## subscriber_friend_cnt  0.091319891  0.010728991    8.51  < 2e-16 ***
## songsListened          0.000007626  0.000000519   14.69  < 2e-16 ***
## lovedTracks            0.000695027  0.000049334   14.09  < 2e-16 ***
## posts                  0.000084922  0.000095800    0.89    0.375    
## playlists              0.059201112  0.013331442    4.44  9.0e-06 ***
## shouts                 0.000110769  0.000084280    1.31    0.189    
## tenure                -0.004476042  0.001021990   -4.38  1.2e-05 ***
## good_country1         -0.415225836  0.040783753  -10.18  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 24537  on 43826  degrees of freedom
## Residual deviance: 22613  on 43812  degrees of freedom
## AIC: 22643
## 
## Number of Fisher Scoring iterations: 5

Model improvement

After deleting non-significant variables, the test result shows that there is no significant difference between previous model and simple model. So the simple model is preferred due to the simplicity. In addition, the is no multicollinearity problem since the VIF of all the variables are under 5, which is the rule of thumb.

#delete non-significant variables
fit2 = glm(adopter ~ age + male + friend_cnt + avg_friend_age + friend_country_cnt + avg_friend_male +
             subscriber_friend_cnt + songsListened + lovedTracks + playlists + tenure + good_country, 
           data = highnode, family = binomial())
summary(fit2)
## 
## Call:
## glm(formula = adopter ~ age + male + friend_cnt + avg_friend_age + 
##     friend_country_cnt + avg_friend_male + subscriber_friend_cnt + 
##     songsListened + lovedTracks + playlists + tenure + good_country, 
##     family = binomial(), data = highnode)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -5.390  -0.411  -0.350  -0.291   2.702  
## 
## Coefficients:
##                           Estimate   Std. Error z value Pr(>|z|)    
## (Intercept)           -4.178014173  0.095742066  -43.64  < 2e-16 ***
## age                    0.019452737  0.003479446    5.59  2.3e-08 ***
## male1                  0.410787361  0.041653667    9.86  < 2e-16 ***
## friend_cnt            -0.004245446  0.000488720   -8.69  < 2e-16 ***
## avg_friend_age         0.029512585  0.004487202    6.58  4.8e-11 ***
## friend_country_cnt     0.044063934  0.003614547   12.19  < 2e-16 ***
## avg_friend_male        0.115931086  0.063478277    1.83    0.068 .  
## subscriber_friend_cnt  0.091497113  0.010734994    8.52  < 2e-16 ***
## songsListened          0.000007738  0.000000515   15.02  < 2e-16 ***
## lovedTracks            0.000699438  0.000049270   14.20  < 2e-16 ***
## playlists              0.058975070  0.013314780    4.43  9.5e-06 ***
## tenure                -0.004417671  0.001021436   -4.32  1.5e-05 ***
## good_country1         -0.416004121  0.040777587  -10.20  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 24537  on 43826  degrees of freedom
## Residual deviance: 22618  on 43814  degrees of freedom
## AIC: 22644
## 
## Number of Fisher Scoring iterations: 5
#Model comparison -> fit2 is better for simplicity
anova(fit1, fit2, test="Chisq")
#Multicollinearity -> all the variables are smaller than 5, which is the rule of thumb
library(car)
vif(fit2)
##                   age                  male            friend_cnt 
##                2.0277                1.0609                4.1079 
##        avg_friend_age    friend_country_cnt       avg_friend_male 
##                2.0624                2.6171                1.0421 
## subscriber_friend_cnt         songsListened           lovedTracks 
##                2.9224                1.2657                1.1482 
##             playlists                tenure          good_country 
##                1.0441                1.2129                1.0295

Model evaluation

  • McFadden’s Pseudo-R2 (similar to R square)

A rule of thumb of McFadden’s pseudo R-squared is that ranging from 0.2 to 0.4 indicates very good model fit. The McFadden’s Pseudo-R2 of the model is 0.0782, which is not very good.

  • ROC curve

The AUC of the ROC = 0.739, which is good.

  • Confusion matrix and Cohen’s kappa.

The overall accuracy is 0.649. Cohen’s kappa of the confusion matrix is 0.128, meaning that the model increases the accuracy by 12.8% compared to random classification

#Use 'McFadden's Pseudo-R2'. The measure ranges from 0 to just under 1, with values closer to zero indicating that the model has no predictive power.
#detailed can be find in: https://www.r-bloggers.com/evaluating-logistic-regression-models/
library(pscl)
pR2(fit2)  
##           llh       llhNull            G2      McFadden          r2ML 
## -11309.019376 -12268.456570   1918.874387      0.078204      0.042838 
##          r2CU 
##      0.099924
#plot ROC curve
library(pROC)
g = roc(highnode$adopter ~ predict(fit2,type = "response"))
plot(g,print.thres = "best",print.auc=TRUE)

#find the best threshold
best_thres = coords(g, "best", ret=c("threshold", "specificity", "sensitivity"))
print(best_thres)
##   threshold specificity sensitivity 
##    0.072675    0.643548    0.705982
#confusion matrix and Kappa
library(caret)
as.numeric(predict(fit2, newdata = highnode, type = "response") > best_thres[1]) %>%
  confusionMatrix(reference = highnode$adopter)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 25935  1037
##          1 14365  2490
##                                         
##                Accuracy : 0.649         
##                  95% CI : (0.644, 0.653)
##     No Information Rate : 0.92          
##     P-Value [Acc > NIR] : 1             
##                                         
##                   Kappa : 0.128         
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 0.644         
##             Specificity : 0.706         
##          Pos Pred Value : 0.962         
##          Neg Pred Value : 0.148         
##              Prevalence : 0.920         
##          Detection Rate : 0.592         
##    Detection Prevalence : 0.615         
##       Balanced Accuracy : 0.675         
##                                         
##        'Positive' Class : 0             
## 
#library(measures)
#KAPPA(highnode$adopter, #truth
#      as.numeric(predict(fit2, newdata = highnode, type = "response") > best_thres[1])) #predicted values 

Coefficient Interpretation

Generally, the following variables have a positive effect on the likelihood of becoming an adopter, including

  • age, male, avg_friend_age, friend_country_cnt, avg_friend_male, subscriber_friend_cnt, songsListened, lovedTracks, and playlists.

The following variables have a negative effect on the likelihood of becoming an adopter, including

  • friend_cnt, tenure, and good_country.

But friend_cnt, songsListened, lovedTracks, and tenure may not be significant in economical level due to the small coefficient value. For example, if the user listened to one more songs, the odds of becoming an adopter (p/(1-p)) will increase to around 1.000008 time, which is quite small.

Three kinds of variables should be attached more importance to, which are gender (male and avg_friend_male), subscriber_friend_cnt, and good_country due to the large coefficient value.

#odds ratio
exp(coef(fit2))
##           (Intercept)                   age                 male1 
##              0.015329              1.019643              1.508005 
##            friend_cnt        avg_friend_age    friend_country_cnt 
##              0.995764              1.029952              1.045049 
##       avg_friend_male subscriber_friend_cnt         songsListened 
##              1.122918              1.095814              1.000008 
##           lovedTracks             playlists                tenure 
##              1.000700              1.060749              0.995592 
##         good_country1 
##              0.659678

Takeaways

From the descriptive statistics, it is shown that a typical adopter is older male with many male friends. In the meantime, the regression analysis shows that large age and male-gender (male and avg_friend_male) will increase the possibility of becoming an adopter. As a result, for High Node, the “free to fee” strategy should focus more on the older male since they are the target users. The company could display more ads to this group of user and develop more function to satisfy their needs to attract them to pay.

From the data visualization, adopter group is higher than non-adopter group in each aspects of user engagement, meaning that the more user engaged in the site, the more likely he/she will become adopter. Generally, the company should encourage users to interact with the site more frequently, such as recommending more songs to users.

The company could consider to extend their business globally. This is because the regression analysis shows that user who is not from US, UK or Germany is more willing to pay. For example, if High Node attract more Asian users, the overall subscription rate would tend to increase.